Since we do not see a trend, we understand that the time series are stationary (i.e. depend on the time of the observation).
The plot below does not seem to detail a specific seasonal trend and pattern. Therefore, the search pattern of these topics is not related to the time of the year. However, we can see that in certain times of the year there are some large peaks, and the majority seem to be linked to the British Royal family. These peaks could be due to independent events occurring, such as the well-being of the Queen or the arrival of the TV series the Crown that could spark interest in people’s searches in regards to the royal family. We can also see that the Summer Olympic searches are high around the time of it happening or building up to the event. However, once the Olympic games ended, the searches fall drastically, as to be expected.
We have made a bar plot depicting the frequency of views and searches on a given topic/type.
\[ R_i = \frac{X_i - X_{i-7}}{X_{i-7}} * 100 \] \(R_i\) is the relative weekly change (increase or decrease) in percentage.
When plotting the relative weekly change we see the highs and lows more clearly. There is a massive high load in the traffic count April 2016 for Princess Margaret and another big but smaller peak for Winston Churchill in February 2016.
For each time series (\(R_i,\ i = 1, ..., 12\)), we plot their daily count distribution on a Q-Q plot to assess if their data come from a theoretical distribution (e.g. Normal). Using the qq draws the correlation between a given sample and the normal distribution. The geom_line(…, distribution = stats::qqnorm) function visually checks the normality assumption of the distribution. If the assumption does not hold, we look into how/what data points contribute to the violation.
Results: on all below Q-Q plots, we observe that the daily count distributions follow a normal distribution when in the bulk of the distribution but departs from when in the tail indicating that the data is skewed. Indeed, it is right-skewed (i.e. positively skewed).
Knowing that the data is right-skewed we generate the time series’ Mean Residual Plot using the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess. The following plots are used to choose the most adequate thresholds. Indeed, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.
After setting their theoretical optimal threshold, we plot their Peaks-Over-Threshold Plot with their threshold \(u\) to display the exceedences. Note that if a threshold is too low then the extreme value approach cannot be valid and that if it is too high we cannot obtain insightful results because of too few data.
The selected threshold \(u\) is 125.
The chosen threshold indicates that there are not too few data points exceeding the threshold nor to many. Therefore, we assume that using the POT model is suitable.
The selected threshold \(u\) is 35.
The chosen threshold indicates that there are few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 100.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The selected threshold \(u\) is 30.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The selected threshold \(u\) is 25.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The selected threshold \(u\) is 200.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 50.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 700.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 400.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 100.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 70.
The chosen threshold indicates that there are not too few nor to many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The selected threshold \(u\) is 60.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
## 2016_Summer_Olympics Diana,_Princess_of_Wales Elizabeth_II
## quantile99 143.5829 363.1843 642.9873
## uncertainty 179.0603 492.1521 1753.4239
## George_VI Prince_Philip,_Duke_of_Edinburgh
## quantile99 520.8474 1111.279
## uncertainty 10635.0359 1579.921
## Princess_Margaret,_Countess_of_Snowdon Queen_Victoria
## quantile99 973.8326 421.7058
## uncertainty 17545.2920 544.1161
## United_Kingdom United_States Winston_Churchill World_War_I
## quantile99 79.08561 62.24281 846.2276 94.64611
## uncertainty 1784.17935 266.52798 7139.2646 77.44259
## World_War_II
## quantile99 61.75727
## uncertainty 48.12695
The graphical method that we suggest for detecting simultaneous high traffic loads is a Block Maxima interactive plot colored by topic. This allows to visually see the daily maxima between the 12 different web pages. If for one or several days that display high maxima but also other value from other webpages that are just below, it indicates a simultaneous high load.
By pointing the mouse cursor over data points, more information is available. With this plot, we can observe that Princess Margret, Georges IV, and Prince Philip of Edinburgh have the largest simultaneous high loads around 2016 November 6, but also have another smaller simultaneous high loads in 2016 April 21. So the idea for Wikipedia is that if one of these web pages load is increasing they should do caching on the other ones to prevent exhausting of available resources.
For the numerical method, we compute a matrix of the tail dependence coefficients between the different web pages. We are interested in the values that are the closest to 1. They indicate that there is a high likelihood that if an extreme value in the tail arises for one web page, the other webpage is very likely to also display an extreme value in its high tail. Therefore, there is tail dependence and there is probably going to show a simultaneous high load for both (or more than two) pages. In this case, Wikipedia must anticipate and use the caching system on them.
| 2016_Summer_Olympics | Diana,_Princess_of_Wales | Elizabeth_II | George_VI | Prince_Philip,_Duke_of_Edinburgh | Princess_Margaret,_Countess_of_Snowdon | Queen_Victoria | United_Kingdom | United_States | Winston_Churchill | World_War_I | World_War_II | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016_Summer_Olympics | 1.0000000 | 0.0714286 | 0.0000000 | 0.0714286 | 0.0357143 | 0.1071429 | 0.0357143 | 0.1071429 | 0.2142857 | 0.0357143 | 0.0000000 | 0.0357143 |
| Diana,_Princess_of_Wales | 0.0714286 | 1.0000000 | 0.0714286 | 0.0714286 | 0.1428571 | 0.0714286 | 0.0714286 | 0.0357143 | 0.0357143 | 0.0000000 | 0.0000000 | 0.0357143 |
| Elizabeth_II | 0.0000000 | 0.0714286 | 1.0000000 | 0.4642857 | 0.6071429 | 0.4642857 | 0.2857143 | 0.1071429 | 0.0000000 | 0.0357143 | 0.1428571 | 0.0000000 |
| George_VI | 0.0714286 | 0.0714286 | 0.4642857 | 1.0000000 | 0.5357143 | 0.5714286 | 0.3214286 | 0.0714286 | 0.0357143 | 0.0714286 | 0.1071429 | 0.0357143 |
| Prince_Philip,_Duke_of_Edinburgh | 0.0357143 | 0.1428571 | 0.6071429 | 0.5357143 | 1.0000000 | 0.5357143 | 0.2142857 | 0.0714286 | 0.0000000 | 0.0357143 | 0.1071429 | 0.0357143 |
| Princess_Margaret,_Countess_of_Snowdon | 0.1071429 | 0.0714286 | 0.4642857 | 0.5714286 | 0.5357143 | 1.0000000 | 0.2142857 | 0.0714286 | 0.0714286 | 0.0000000 | 0.1071429 | 0.0357143 |
| Queen_Victoria | 0.0357143 | 0.0714286 | 0.2857143 | 0.3214286 | 0.2142857 | 0.2142857 | 1.0000000 | 0.0357143 | 0.0000000 | 0.0000000 | 0.1071429 | 0.0000000 |
| United_Kingdom | 0.1071429 | 0.0357143 | 0.1071429 | 0.0714286 | 0.0714286 | 0.0714286 | 0.0357143 | 1.0000000 | 0.0357143 | 0.1785714 | 0.1071429 | 0.1071429 |
| United_States | 0.2142857 | 0.0357143 | 0.0000000 | 0.0357143 | 0.0000000 | 0.0714286 | 0.0000000 | 0.0357143 | 1.0000000 | 0.0714286 | 0.1785714 | 0.1785714 |
| Winston_Churchill | 0.0357143 | 0.0000000 | 0.0357143 | 0.0714286 | 0.0357143 | 0.0000000 | 0.0000000 | 0.1785714 | 0.0714286 | 1.0000000 | 0.1071429 | 0.0714286 |
| World_War_I | 0.0000000 | 0.0000000 | 0.1428571 | 0.1071429 | 0.1071429 | 0.1071429 | 0.1071429 | 0.1071429 | 0.1785714 | 0.1071429 | 1.0000000 | 0.4285714 |
| World_War_II | 0.0357143 | 0.0357143 | 0.0000000 | 0.0357143 | 0.0357143 | 0.0357143 | 0.0000000 | 0.1071429 | 0.1785714 | 0.0714286 | 0.4285714 | 1.0000000 |